First look at the data:
str(houses)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
## $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
## $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
## $ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
## $ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
## $ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
dim(houses)
## [1] 1460 81
head(houses)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
houses$MSSubClass <- as.factor(houses$MSSubClass)
Let us look at the missing data:
miss <- apply(houses, 2, is.na) %>% apply(., 2, sum)
miss[miss > 0.5 * nrow(houses)] # more than 50% of data is NA
## Alley PoolQC Fence MiscFeature
## 1369 1453 1179 1406
However, in all cases, where NAs are more than 50% of the data, they do not mean missing values. They represent the ‘no’/‘none’ categories in the variables. In the analysis, consider merging the categories to create binary variables (yes and no).
The analysis of continuous (numerical) variables
num.vars <- Filter(is.numeric, houses) %>% names() # 37
num.vars <- num.vars[-1] # delete id
summary(houses[num.vars])
## LotFrontage LotArea OverallQual OverallCond
## Min. : 21.00 Min. : 1300 Min. : 1.000 Min. :1.000
## 1st Qu.: 59.00 1st Qu.: 7554 1st Qu.: 5.000 1st Qu.:5.000
## Median : 69.00 Median : 9478 Median : 6.000 Median :5.000
## Mean : 70.05 Mean : 10517 Mean : 6.099 Mean :5.575
## 3rd Qu.: 80.00 3rd Qu.: 11602 3rd Qu.: 7.000 3rd Qu.:6.000
## Max. :313.00 Max. :215245 Max. :10.000 Max. :9.000
## NA's :259
## YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## Min. :1872 Min. :1950 Min. : 0.0 Min. : 0.0
## 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0 1st Qu.: 0.0
## Median :1973 Median :1994 Median : 0.0 Median : 383.5
## Mean :1971 Mean :1985 Mean : 103.7 Mean : 443.6
## 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 166.0 3rd Qu.: 712.2
## Max. :2010 Max. :2010 Max. :1600.0 Max. :5644.0
## NA's :8
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 334
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 1st Qu.: 882
## Median : 0.00 Median : 477.5 Median : 991.5 Median :1087
## Mean : 46.55 Mean : 567.2 Mean :1057.4 Mean :1163
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2 3rd Qu.:1391
## Max. :1474.00 Max. :2336.0 Max. :6110.0 Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt
## Min. :0.000 Min. : 2.000 Min. :0.000 Min. :1900
## 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.000 1st Qu.:1961
## Median :1.000 Median : 6.000 Median :1.000 Median :1980
## Mean :1.047 Mean : 6.518 Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :14.000 Max. :3.000 Max. :2010
## NA's :81
## GarageCars GarageArea WoodDeckSF OpenPorchSF
## Min. :0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.: 0.00 1st Qu.: 0.00
## Median :2.000 Median : 480.0 Median : 0.00 Median : 25.00
## Mean :1.767 Mean : 473.0 Mean : 94.24 Mean : 46.66
## 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :4.000 Max. :1418.0 Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## MiscVal MoSold YrSold SalePrice
## Min. : 0.00 Min. : 1.000 Min. :2006 Min. : 34900
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 1st Qu.:129975
## Median : 0.00 Median : 6.000 Median :2008 Median :163000
## Mean : 43.49 Mean : 6.322 Mean :2008 Mean :180921
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 3rd Qu.:214000
## Max. :15500.00 Max. :12.000 Max. :2010 Max. :755000
##
We see that some variables have only few possible values (e.g. BsmtFullBath, BsmtHalfBath, FullBath, HalfBath,KitchenAbvGr, Fireplaces, MoSoldby). In further analysis, these variables may be considered as factors.
Let us now look at the correlation structure of numerical variables.
corr <- cor(houses[num.vars]) %>% round(2)
ggcorrplot(corr, type = 'lower')
We see high correlation between variables that are obviously linked together, e.g. GarageCars and GarageArea, YearBuild and YearRemodAd, etc.
The analysis of categorical variables
fac.vars <- Filter(is.factor, houses) %>% names () #44
summary(houses[fac.vars])
## MSSubClass MSZoning Street Alley LotShape LandContour
## 20 :536 C (all): 10 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 60 :299 FV : 65 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## 50 :144 RH : 16 NA's:1369 IR3: 10 Low: 36
## 120 : 87 RL :1151 Reg:925 Lvl:1311
## 30 : 69 RM : 218
## 160 : 63
## (Other):262
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle RoofStyle RoofMatl
## Norm :1445 1Fam :1220 1Story :726 Flat : 13 CompShg:1434
## Feedr : 6 2fmCon: 31 2Story :445 Gable :1141 Tar&Grv: 11
## Artery : 2 Duplex: 52 1.5Fin :154 Gambrel: 11 WdShngl: 6
## PosN : 2 Twnhs : 43 SLvl : 65 Hip : 286 WdShake: 5
## RRNn : 2 TwnhsE: 114 SFoyer : 37 Mansard: 7 ClyTile: 1
## PosA : 1 1.5Unf : 14 Shed : 2 Membran: 1
## (Other): 2 (Other): 19 (Other): 2
## Exterior1st Exterior2nd MasVnrType ExterQual ExterCond Foundation
## VinylSd:515 VinylSd:504 BrkCmn : 15 Ex: 52 Ex: 3 BrkTil:146
## HdBoard:222 MetalSd:214 BrkFace:445 Fa: 14 Fa: 28 CBlock:634
## MetalSd:220 HdBoard:207 None :864 Gd:488 Gd: 146 PConc :647
## Wd Sdng:206 Wd Sdng:197 Stone :128 TA:906 Po: 1 Slab : 24
## Plywood:108 Plywood:142 NA's : 8 TA:1282 Stone : 6
## CemntBd: 61 CmentBd: 60 Wood : 3
## (Other):128 (Other):136
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Heating
## Ex :121 Fa : 45 Av :221 ALQ :220 ALQ : 19 Floor: 1
## Fa : 35 Gd : 65 Gd :134 BLQ :148 BLQ : 33 GasA :1428
## Gd :618 Po : 2 Mn :114 GLQ :418 GLQ : 14 GasW : 18
## TA :649 TA :1311 No :953 LwQ : 74 LwQ : 46 Grav : 7
## NA's: 37 NA's: 37 NA's: 38 Rec :133 Rec : 54 OthW : 2
## Unf :430 Unf :1256 Wall : 4
## NA's: 37 NA's: 38
## HeatingQC CentralAir Electrical KitchenQual Functional FireplaceQu
## Ex:741 N: 95 FuseA: 94 Ex:100 Maj1: 14 Ex : 24
## Fa: 49 Y:1365 FuseF: 27 Fa: 39 Maj2: 5 Fa : 33
## Gd:241 FuseP: 3 Gd:586 Min1: 31 Gd :380
## Po: 1 Mix : 1 TA:735 Min2: 34 Po : 20
## TA:428 SBrkr:1334 Mod : 15 TA :313
## NA's : 1 Sev : 1 NA's:690
## Typ :1360
## GarageType GarageFinish GarageQual GarageCond PavedDrive PoolQC
## 2Types : 6 Fin :352 Ex : 3 Ex : 2 N: 90 Ex : 2
## Attchd :870 RFn :422 Fa : 48 Fa : 35 P: 30 Fa : 2
## Basment: 19 Unf :605 Gd : 14 Gd : 9 Y:1340 Gd : 3
## BuiltIn: 88 NA's: 81 Po : 3 Po : 7 NA's:1453
## CarPort: 9 TA :1311 TA :1326
## Detchd :387 NA's: 81 NA's: 81
## NA's : 81
## Fence MiscFeature SaleType SaleCondition
## GdPrv: 59 Gar2: 2 WD :1267 Abnorml: 101
## GdWo : 54 Othr: 2 New : 122 AdjLand: 4
## MnPrv: 157 Shed: 49 COD : 43 Alloca : 12
## MnWw : 11 TenC: 1 ConLD : 9 Family : 20
## NA's :1179 NA's:1406 ConLI : 5 Normal :1198
## ConLw : 5 Partial: 125
## (Other): 9
Variable Utilities is totally useless as it contains only two categories and in one of them only one observation. Condition2 might be also considered as useless - small numbers of observations in some categories. In further analysis, consider merging categories for some variables as some of them contain only few observations (e.q. quality).
Let us look more closely at the target variable SalePrices.
summary(houses$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
This is the histogram and the estimation fo its density:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We see that the distribution is skewed.
Now, we plot the boxplots for sale prices per categories of all categorical variables.
Now, we look at the correlation of sale prices with the rest of numerical variables.
(cor.sp <- corr[, "SalePrice"])
## LotFrontage LotArea OverallQual OverallCond YearBuilt
## NA 0.26 0.79 -0.08 0.52
## YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 0.51 NA 0.39 -0.01 0.21
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## 0.61 0.61 0.32 -0.03 0.71
## BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 0.23 -0.02 0.56 0.28 0.17
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars
## -0.14 0.53 0.47 NA 0.64
## GarageArea WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0.62 0.32 0.32 -0.13 0.04
## ScreenPorch PoolArea MiscVal MoSold YrSold
## 0.11 0.09 -0.02 0.05 -0.03
## SalePrice
## 1.00
These are the variables where the correlation with sale prices is higher than 0.5.
cor.sp.high
## OverallQual YearBuilt YearRemodAdd TotalBsmtSF X1stFlrSF GrLivArea
## 0.79 0.52 0.51 0.61 0.61 0.71
## FullBath TotRmsAbvGrd GarageCars GarageArea SalePrice
## 0.56 0.53 0.64 0.62 1.00
Let us plot the highly correlated variables versus the sale prices.
We see some trends.